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ABSTRACT 

We  formalise  a  model  of  array  architectures  suitable  for  VLSI  implementation.  A 
formal  model  of  an  arbitrarily  structured  tree  machine  is  also  presented.  A  mathemati¬ 
cal  framework  is  developed  to  transform  cube  graphs,  which  are  data-flow  descriptions  of 
certain  matrix  computations,  onto  the  array  and  tree  models.  All  published  algorithms 
for  these  computations  can  be  obtained  using  the  mathematical  framework.  In  addition, 
novel  linear-array  algorithms  for  matrix  multiplication  are  obtained.  More  importantly, 
the  algorithms  obtained  for  the  tree  model  are  of  special  significance.  Besides  their  no¬ 
velty,  the  independence  of  the  tree  algorithms  from  a  specific  inter-processor  communica¬ 
tion  geometry  make  them  robust  to  hardware  faults  as  opposed  to  algorithms  that  are 
based  on  specific  interconnection  requirements. 
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1.  Introduction 


Specialized  array  processors  hare  beea  proposed  as  a  means  of  handling  compute- 
bound  problems  in  a  cost-effective  and  efficient  manner  [4,5,6].  These  array  processors 
are  typically  made  up  of  simple,  identical  processing  elements  that  operate  in  syn¬ 
chrony.  Several  array  structures  have  been  proposed  that  include  linear  arrays,  rectangu¬ 
lar  arrays  and  hexagonal  arrays.  Simplicity  and  regularity  of  linear,  rectangular  and  hex¬ 
agonal  array  processors  render  them  suitable  for  VLSI  implementation.  High  perfor¬ 
mance  is  achieved  by  extensive  use  of  pipelining  and  multiprocessing.  In  a  typical  appli¬ 
cation,  such  arrays  would  be  attached  as  peripheral  devices  to  a  host  computer  which 
inserts  input  values  into  them  and  extracts  output  values  from  them. 


A  variety  of  algorithms  have  been  designed  for  such  arrays  [1,  2,  5,  7,  10].  All  these 
algorithms  exhibit  the  following  feature.  They  are  composed  of  streams  of  data  travelling 
in  multiple  directions  at  multiple  speeds.  Each  processing  element  receives  data  from 
each  of  the  streams,  performs  some  simple  operation  and  pumps  them  out  (possibly 
updated).  We  will  refer  to  such  algorithms  as  “array  algorithms’’.  The  array  is  typi¬ 
cally  comprised  of  identical  proccuon,  that  is,  they  all  execute  the  same  set  of  instruc¬ 
tions  in  every  instruction  cycle  and  they  do  not  have  any  control  unit.  The  array  is 
driven  by  a  single-phase  or  two-phase  global  clock  [9]. 

A  few  methodologies  have  been  proposed  for  transforming  high-level  specifications 
onto  array  algorithms  [3,8,14].  Though  these  methodologies  are  imaginative  they  lack  a 
mathematical  basis.  In  this  paper  we  outline  a  mathematical  framework  for  transforming 
data-flow  descriptions  of  matrix  and  related  computations  onto  array  machines  and  tree 


machines.  The  research  presented  herein  is  an  extension  to  our  work  on  transformation 


of  high-level  specifications  onto  linear-array  algorithms  [11].  In  this  paper  we  generalize  / 
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the  formal  model  of  a  linear  array  developed  in  [11]  to  include  two-dimensional  arrays 
(rectangular  and  hexagonal  arrays).  We  also  formalise  another  important  model, 
namely,  tree  machines  having  arbitrary  structure.  Algorithms  on  such  tree  machines  are 
particularly  important.  Any  connected  set  of  processors  (that  is,  any  two  processors  in 
the  set  can  communicate  with  each  other  either  directly  or  indirectly  through  other  pro¬ 
cessors  and  communication  links  in  the  set)  with  no  a  priori  topological  restrictions  can 
be  used  to  execute  these  algorithms  by  forming  a  spanning  tree  of  the  connected  set  of 
processors.  The  connected  set  of  processors  could  be  the  non-fauity  processors  in  an 
underlying  host  machine  (like  a  rectangular  or  a  hexagonal  array)  which  has  both  faulty 
and  non-faulty  processors  and  communication  links.  The  hardware  details  of 
reconfiguring  such  a  connected  component  of  processors  are  provided  in  [12,13]. 

This  paper  is  organized  as  follows.  In  Section  2  we  formalize  the  array  and  tree 
machine  models.  We  also  introduce  cube  graphs  which  are  data-flow  descriptions  of  some 
matrix  and  related  computations.  In  Section  3  we  provide  a  precise  formulation  of 
correctly  transforming  cube  graphs  (referred  to  as  mapping  cube  graphs  in  our  terminol¬ 
ogy)  onto  the  array  and  tree  models.  Algorithms  for  mapping  cube  graphs  onto  the 
array  and  tree  models  are  also  presented  in  Section  3.  In  the  Appendix  we  provide  a 
proof  that  the  algorithms  in  Section  3  correctly  transform  a  cube  graph  into  array  and 
tree  algorithms. 

2.  Computational  Models 

We  now  formalize  the  array  model,  the  tree  model  and  cube  graphs.  We  begin  with 
a  formal  definition  of  an  array  processor. 
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2.1.  Array  Machine  Model 

Let  ip  l2,  ...If  be  z  sets  of  sequences  of  integers  where  each  ranges  from  1  to  mr 
Let  ICIjXLX-Xl,. 

Definition  2.1:  An  array  machine  At  is  a  4-tuple  <NAr,  T^,  iM,  i>Ar>  where: 

1.  NAr  is  a  set  of  identical  processors. 

2.  TAr=*{/  l,/2,..,/k}  is  the  set  of  labels. 

3.  <5Ar:NAr  is  a  one-one  function  that  assigns  coordinates  to  every  processor  in 
the  Euclidean  z-space. 

4.  Every  processor  in  the  array  has  k  input  ports  and  k  output  ports,  with 
each  input  port  and  output  port  assigned  a  unique  label  from  ^*Ar 

5.  The  array  is  driven  either  by  a  single-phase  or  a  two-phase  global  clock.  A 
phase  can  be  viewed  as  the  instruction  cycle  of  a  processor.  In  a  single-phase 
clocking  scheme  all  processors  are  activated  in  every  phase  and  every  pro¬ 
cessor  computes  a  k-ary  function  i>Ar  In  a  two-phase  clocking  scheme,  adja¬ 
cent  processors  are  activated  during  opposite  phases  of  the  clock  and  every 
processor  computes  ^Ar  in  the  phase  it  is  active. 

The  value  of  z  and  the  communication  geometry  determine  the  structure  of  the 
array  processor.  In  this  paper  we  will  be  examining  three  types  of  array  processors, 
namely,  linear,  mesh  and  hexagonal  arrays  which  are  well-suited  for  VLSI  implementa¬ 
tion  (4,9j.  We  now  formalize  these  three  arrays.  Our  definition  captures  the  “nearest- 
neighbor”  interconnection  of  these  arrays  and  also  the  intuitive  notion  of  a  data  stream 
used  earlier  in  the  description  of  array  algorithms.  V*i€TAr,  let  VLt)  be  the  neighborhood 
conttant  associated  with  label  l j. 
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Definition  2.3:  A  linear  array  LAr  is  an  array  processor  with  z=»l,  that  is,  ICI,.  Besides 
the  linear  array  has  the  following  communication  features.  Let  p  be  a  processor  index. 
Then,  \/lj€T\t,  the  output  port  labelled  l  j  of  p  is  connected  to  the  input  port  labelled 
/ j  of  p+ntj  where  nfj6{l,-l,0}. 

Let  Lh,  Ly  and  Lo  be  three  disjoint  sets  of  labels  such  that  LhJJLv|JLo  =»  TAr. 

Definition  2.3:  A  meth  array  is  an  array  processor  with  t=2,  that  is,  ICI,  xl2. 
Besides,  the  mesh  array  has  the  following  communication  features.  Let  <p,q>  denote 
the  coordinate  of  any  processor  in  the  mesh  array.  Then, 

1.  \// j€L0,  the  output  port  labelled  /j  of  <p,q>  is  connected  to  its  own 
input  port  labelled  /  j,  that  is,  n(j— 0. 

2.  V*jeLH,  output  port  labelled  l j  is  conected  to  the  input  port  labelled  / j 
of  <p+n/j,q>  where  n(j6{l,-l}. 

3.  VljeLy,  the  output  port  labelled  l  j  is  connected  to  the  input  port  labelled 
/ j  of  <p,q+n/j>  where  n/j€{l,-i}. 

Let  LH(jLvljL0(jLT«TAr  be  four  disjoint  sets  of  labels. 

Definition  2.4:  Let  c€{l,*l}  denote  the  hexagonal  array  constant.  A  hexagonal  array 
is  similar  to  a  mesh  array  with  the  additional  communication  feature  that  j£LT, 
the  output  port  labelled  Ij  of  <p,q>  is  connected  to  the  input  port  labelled  Ij  of 
<p+n,j,  q+njjc>  where  n/j€{l,-l}. 

Fig.  2.1,  Fig.  2.2  and  Fig.  2.3  illustrate  a  linear,  mesh  and  hexagonal  array  proces¬ 


sors.  In  the  figures  lt,  I2  and  f3  denote  external  input  ports  and  Ot,  02  and  03  denote 
external  output  ports. 
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Figure  2-1  Figure  2-2  Figure  2-3 


In  Fig.  2.1,  the  links  between  processors  directed  from  west  to  east  are  labelled  / 1 
and  those  directed  from  east  to  west  are  labelled  12.  The  links  connecting  a  processor 
back  to  itself  are  labelled  13.  The  neighborhood  constants  are 
njj  *■  0,  b/2  »  -1,  and  n/3  *■»  0. 

In  Fig.  2.2,  the  links  directed  from  west  to  east  are  labelled  II  and  the  links 
directed  from  north  to  south  are  labelled  1 2.  Lh™{/1}  and  Lv—{12}  and 
n/i  “  ai 2 

In  Fig.  2.3,  the  links  pointing  northeast  are  labelled  11,  the  links  pointing  southeast 
are  labelled  12  and  the  links  directed  from  south  to  north  are  labelled  13.  LH”»{1 1 }, 
Lv={/2}  and  Lx«{/3}.  n/i=*n,2— 1  and  n{3—-l.  The  hexagonal  array  constant  c— -1. 

We  will  refer  to  the  processor  whose  input  port  labelled  l j  is  connected  to  the 
output  port  labelled  Ij  of  processor  p  as  its  neighbor  with  roped  to  label  I  j.  If  a  pro¬ 
cessor  q  is  the  neighbor  of  p  with  respect  to  label  Ij  then  q  can  only  receive  data  from  p 
on  the  link  labelled  /  j  connecting  them.  Similarly  p  can  only  send  data  to  q  on  the  same 
link.  The  links  connecting  any  two  processors  are  unidirectional.  Impose  a  direction  on 
the  links  such  that  the  sender  is  at  the  tail  end  and  the  receiver  at  the  other  end.  A 


stream  then,  is  a  directed  path  through  processors  and  links  haring  the  same  label. 


We  model  the  speed  of  data  in  streams  by  associating  a  queue  of  buffers  in  the 
communication  links.  More  precisely,  let  s  be  a  processor  in  the  array.  Let 
sit=*<sit\  sit2,  ..,sitk>  denote  the  k-tuple  input  to  processor  s  at  time  t  where  sitJ  is  the 
value  at  the  input  port  labelled  /  j  of  processor  s  at  time  t.  Let  sot=«<sot1,  sot2,  ..,sotk> 
denote  the  k-tuple  output  computed  by  processor  s  at  time  t,  that  is,  i>Ar(sit)=3o..  Ele¬ 
ments  in  a  data  stream  travel  at  a  constant  velocity,  and  hence  a  non-zero  positive  delay 
constant  d/j  is  associated  with  every  label  l  j  in  TAr  such  that  sotJ  appears  at  the  output 
port  labelled  /  j  of  s  at  time  t+djj.  The  delay  dfj  can  be  implemented  as  a  queue  using  a 
shift  register  of  length  d/j-1. 

2.2.  Tree  Machine  Model  We  are  now  in  a  position  to  formalize  a  tree  machine 
as  follows. 

Definition  2.5:  A  tree  machine  is  a  set  of  processors  in  a  tree  that  are  indexed  by 
some  depth-first  traversal  of  the  tree.  Besides  it  has  the  following  communication 
features.  Let  p  be  a  proceesor  index.  Then,  yij€ T^ri  the  output  port  labelled  Ij  of  p  is 
connected  to  the  input  port  labelled  Ij  of  p+n(j  where  0^6(1, -1,0}.  Besides,  for  every 
label  IjCT^r,  and  for  every  communication  link  between  the  output  port  of  a  processor 
indexed  p  and  the  input  port  of  the  processor  indexed  p+Djj,  associate  a  delay 
j(/j,p)*d(j+A(l j,p)  where  djj  is  the  delay  constant  associated  with  any  communication 
link  labelled  Ij,  and  A(/j,p)  is  the  perturbation  delay  between  processor  p  and  p+n/r 

A  tree  machine  is  a  generalization  of  the  linear  array  model  (see  definition  2.2). 
The  term  tree  machine  signifies  that  the  interconnection  between  processors  in  the  array 
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can  be  represented  by  an  arbitrary  tree,  the  vertices  of  which  represent  processors  and 
edges  represent  the  adjacency  relation  between  the  processors.  The  corresponding 
representation  for  a  linear  array  could  be  a  special  case  of  a  tree  forming  a  path  graph. 

On  any  such  tree  of  processors  it  is  possible  to  simulate  the  data  flow  through  a 
linear  array  by  routing  the  data  streams  through  a  closed  path  around  the  periphery  of 
the  tree  (see  Fig.  2.4). 


closed  path: 
(abc  . . .  hij ) 


Figure  2.4 

The  major  difference  between  this  “logical  pipeline"  in  a  tree  machine  and  a  “physical 
pipeline"  in  the  linear  array  model  is  that  in  the  former,  logically  adjacent  processors 
(i.e.,  the  pair  indexed  i  and  i+1  )  need  not  be  physically  adjacent.  Since  all  the  data 
streams  flow  through  the  array  at  a  finite  velocity,  the  implication  of  this  physical 
separation  is  that  the  delay  encountered  by  a  data  element  in  traversing  the  array  from 
processors  i  to  i+1  (or  vice  versa)  is  a  function  of  both  the  delay  constant  associated 
with  the  stream  to  which  that  element  belongs  and  of  the  physical  separation  between 
the  processors. 

Our  tree  machine  model  (definition  2.5)  is  motivated  by  this  idea.  The  delay  for  a 
data  stream  /  j  between  processors  indexed  p  and  p-f n(j  is  represented  by 


s 


/  j,p)  =»  d/j+A(/j,p).  The  first  quantity  is  the  delay  constant  associated  with  any  link 
labelled  /j  and  the  second  quantity  is  the  perturbation  in  this  delay  caused  by  the  non- 
adjacent  physical  arrangement  of  the  logically  adjacent  processors  p  and  p+njj. 

2.3.  Cube  Graphs 

We  now  provide  a  formal  definition  of  graphs  that  we  will  be  mapping  later  on  onto 
linear,  mesh  and  hexagonal  arrays  and  tree  machines. 

Let  G=<V,E,Lg>  be  a  labelled  DAG  where: 

1.  V=VG(jSOG(jSIG,  and  VG,  SOG  and  S1G  are  three  disjoint  sets  of  ver¬ 
tices  with  SOG  the  set  of  source  vertices,  SIG  the  set  of  sink  vertices  and  VG 
the  set  of  remaining  vertices,  which  we  shall  call  computation  vertices, 

2.  L g= {/  l,f  2,13}  is  a  set  of  labels. 

3.  Every  vertex  in  VG  has  three  incident  edges  and  three  outgoing  edges, 
where  each  incident  and  outgoing  edge  is  assigned  a  unique  label  from  LG. 

In  any  execution  of  G  on  the  array  or  the  tree,  every  computation  vertex  in  G  is  a  sin¬ 
gle  instance  of  a  function  evaluation  that  is  performed  in  a  cycle  by  a  processor  in  the 
array  or  the  tree.  As  all  processors  compute  the  same  function,  every  computation  ver¬ 
tex  also  represents  the  same  function. 

We  can  view  the  k  incoming  edges  to  a  computation  vertex  vx  as  representing  the  k- 
tuple  input  value  to  the  processor  that  evaluates  vx.  Similarly,  we  can  view  the  k  outgo¬ 
ing  edges  from  v„  as  the  k-tuple  output  value  that  is  computed  by  the  processor  on 
evaluating  v„.  Throughout  the  rest  of  this  paper  we  will  adopt  the  terminology  that  a 
source  vertex  represents  an  input  value  and  a  sink  vertex  represents  an  output  value. 
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Let  J,  J2  and  J,  be  three  sequences  of  integers  ranging  from  0  to  hlt  0  to  h2  and  0 
to  h3  respectively.  Let  JCJ1XJ2XJ3. 

Definition  2.8:  G  is  a  Cube  Graph  iff  there  exists  a  one-one  function  F:VG  -*  J  that 
satisfies  the  following:  Let  Fa,  F<2  and  F;j  be  three  projection  functions  of  F,  that  is,  if 
F(v,H<e1,c2,c3>  then  F, iK)***,,  F/2(vx)=c2  and  F/3(vx)=c3.  Let  vx  and  vy  be  any 
two  computation  vertices  in  VG.  Then,  for  any  label  /j€LG,  there  exists  a  path 
comprised  only  of  edges  labelled  Ij  passing  through  vx  and  vy  such  that  the  distance 
from  vx  to  vy  is  d  iff  F , j(vy)-=F/j(vx)+d  and  V*  ‘€L<r{  *i} .  F/i(vy)”F  x)- 

Henceforth,  throughout  the  rest  of  this  paper  G  will  denote  a  cube  >ph.  A  cube  graph 
is  an  object  in  Euclidean  3-Space  and  we  will  refer  to  the  3  axes  s  ,  /2nd  and  /3rd 
axes.  hj>l,  h2>l  and  h3>l  are  the  maximum  dimensions  along  / 1*\  /2“d  and  / 3rd  axes 
respectively.  If  vx  is  a  computation  vertex  in  a  cube  graph  then  we  will  denote  Fatv*), 
FJ2(vx)  and  F,3(vJ  by  x(1,  x,2  and.x(3  respectively.  Let  v0  denote  the  vertex  whose  coor¬ 
dinates  are  <0,0,0>. 

3.  Mapping  Cube  Graphs  on  Arrays  and  Trees 

Intuitively  mapping  of  G  onto  an  array  or  tree  machine  assigns  each  computation 
vertex  of  G  to  a  processor  in  the  machine  at  a  particular  time  step  and  also  fixes  the 
delay  and  neighborhood  constant  for  every  label  in  LG.  Assuming  discrete  time  steps,  let 
T-«{0,1,2,..}  be  the  sequence  of  natural  numbers  representing  the  progress  of  a  compu¬ 
tation  from  its  start  at  time  0. 


Definition  3.1:  A  mapping  of  G  onto  a  linear,  rectangular,  hexagonal  or  tree  machines 
is  a  4-tuple  <PA,TA,NA,DA>  where: 
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1.  T»r — Lg 

2.  PA:VG— »I  and  TA:VG— »T  are  many-one  functions  mapping  computation 
vertices  onto  processors  and  time  steps  respectively. 

3.  Let  be  a  set  of  positive  non-zero  integers.  NA:LG— *{1,-1,0}  and 
DA:LG~»I+  are  many-one  functions  assigning  neighborhood  constants  and 
delay  constants  respectively. 

[Note:  NA(/j)=n(j  and  D A(/ j )=d/ j  ] 

We  next  formalize  a  correct  mapping. 

Definition  3.2:  A  mapping  is  syntactically  correct  iff 

1.  \/*i£LG  and  ^or  any  pair  of  computation  vertices  vx  and  vy,  if  there  is  an 
edge  labelled  /  j  directed  from  vx  to  vy,  then  PA(vy)  is  the  neighbor  of 
PA(VX)  with  respect  to  label  I  j  and  TA(vy)=TA(vx)-l-d|j. 

2.  No  two  values  appear  simultaneously  rt  the  same  input  port  of  any  pro¬ 
cessor. 

3.1.  Linear  Array  Mapping 

We  now  describe  the  algorithm  to  map  G  onto  a  linear  array  LAr.  We  begin  by  develop¬ 
ing  some  appropriate  terminology  for  describing  the  algorithm. 

Let  wL=<w,,w;,W3>  be  a  triple  where  w,*»lrw2g{lr-l}  and  w36{1,-1}. 

Definition  3.3:  A  linear  diagonalization  DL  of  a  cube  graph  is  a  pair  <D,w>  with  the 

following  properties. 


11 


1.  D={D!,  D2,  ...  Dfc}  is  a  family  of  sets  of  computation  vertices  and 

D^U'-lPk-VG. 

2.  V^pCD.  if  vx  and  yy  are  *n  Dp  then  WiXfj+WoXfo+WjX^  = 

»iyii+w-y/2+wjyu. 

i=J  1=3 

3.  \/Dp  €D  and  VDq€D,  p<q  iff  \/vx  in  Dp  and  \fv7  in  Dq,  VwjX/j  <  ^wj,,. 

i=i  i=i 

We  will  refer  to  wL  as  the  linear  diagonalization  factor  of  a  cube  graph  and  to  any  Dp£D 

i=s 

as  a  linear  diagonal.  If  vx  is  in  Dp  then  we  will  refer  to  ^WjXjj  as  the  weight  of  Dp. 

i=i 

We  assign  consecutive  indices  to  the  diagonals  in  D  in  increasing  order  of  their  weights 
with  the  diagonal  having  the  least  weight  assigned  index  1. 

Algorithm 

We  are  now  in  a  position  to  describe  the  linear  array  mapping  algorithm. 

1.  Perform  a  linear  diagonalization  Dl=<D,wl>  of  the  cube  graph.  For  every 
Dp£D  assign  a  proceesor  indexed  p. 

2.  Choose  n|1=w1,  a|;»*Wj  and  nf3*»wj.  This  fixes  the  neighborhood  con¬ 
stants  of  the  labels. 

3.  Choose  dd=l.  If  nj2™l  then  choose  dj2*2  else  choose  dj2=l.  Choose  d/3 
as  follows. 

If  n/2*l  then  if  hj-h2+nf3^0  then  choose  dj3=h1+l+2nj3  else  choose 
d(3*»h2+l+n(3 

If  n<2**-l  then  if  h2-h1+n/3>0  then  choose  d/j=2h2+l+n<j  else  choose 

df3=2h,+l-n/3. 


-1 
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4.  Map  vertices  in  D,  onto  processor  i,  that  is,  \/vx  in  D,  ,  let  PA(vx)=i. 

1=3 

5.  Let  TA(vx)=»  Vxj.dj,  +  t,  where  TA(v0)=t, 

i=i 


3.2.  Mesh  Array  Mapping 

We  next  describe  the  algorithm  to  map  G  onto  a  mesh  array  MAr. 

Let  be  a  triple  where  Wj*=l,  w26{1(-1},  and  w3=l.  Let 

Lq  “  Lo(jLy.  Let  1 1£L||  and  /3(ELv- 
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We  assign  consecutive  horizontal  indices  to  the  diagonals  in  increasing  order  of  their  hor¬ 
izontal  weights  with  the  diagonal  having  the  least  horizontal  weight  assigned  the  hor¬ 
izontal  index  1.  Similarly,  we  assign  consecutive  vertical  indices  to  the  diagonals  in 
increasing  order  of  their  vertical  weights  with  the  diagonals  having  the  least  vertical 
weight  assigned  the  vertical  index  1. 

Algorithm 

We  are  now  in  a  position  to  describe  the  mesh  array  mapping  algorithm. 

1.  Perform  a  mesh  diagonalization  =■  <D,wM>  of  the  cube  graph.  For 
every  D<pq>GD  assign  a  processor  to  the  ptk  row  and  qtk  column  of  a  mesh. 

2.  Choose  n/i— w,,  n/2=w2  and  n/3=w3.  This  fixes  the  neighborhood  con¬ 
stants  of  the  labels. 

3.  Choose  dn— 1,  d/3*=l.  If  w2=*l  then  choose  d<;=*2  else  choose  df2=*=l. 

4.  Map  vertices  on  D<p  q>  onto  the  processor  in  the  pth  row  and  qtk  column, 
that  is,  tyv„  in  D<pq>,  let  PA(vx)=*<p,q>. 

i=S 

5.  Let  TA(vx)=»  Vx»i^/i  +  where  TA(v0)=tj. 

i=i 

3.3.  Hexagonal  Array  Mapping 

We  describe  the  algorithm  to  map  G  onto  a  hexagonal  array  HAr.  Let 
wH«<w,,w2,w3>  be  a  triple  where  Wj=»l,w2=l  and  w3£{1,-1}.  Let  the  hexagonal 
array  constant  c6{l,-l}.  Let  LG  «=  LH  |J  Lv  JJ  Lf  and  let 
1 1  6Lh,  /  2  6LV,  and  /3€Lt. 


Definition  3.4:  A  hexagonal  diagonalization  DH  of  a  cube  graph  is  a  pair  <D,wH>  with 
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the  following  properties. 

1.  D»{D<j  i>  D<i2>  D<m  a>}  is  a  family  of  sets  of  computation  vertices 
and  D<u>yD 

<1,2>  u-u 

2.  For  any  £><P,<,>€D,  if  vx  and  vy  are  in  D<pq>  then 
w,x/i+w?x/3  =  Wjyu+Wjy,,  and  w2Xj2+w3x/3c  =  w2y(2+w3y<3c. 

3.  VD<p,<i>€D  and  VD<r,j>€D,  P<r  iff  \/v*  in  D<pq>  and  Vvy  *“  D<t,,> 

w1x/l-+-'w3x<3  <  w1yn+wjy,j.  Similarly,  q<s  iff 

w2X/2+w3x<3c  <  w2y12+Wjy<3c. 

We  will  refer  to  wH  as  the  hexagonal  diagonalization  factor  of  a  cube  graph  and  to 
any  D<pq>€D  as  a  hexagonal  diagonal.  If  vx  is  in  D<pq>  then  we  will  refer  to 
w,x/,+w3x/3  as  the  horizontal  weight  and  w2Xf2+w3X/3c  as  the  vertical  weight  of  D<pq> 
respectively,  p  and  q  will  denote  the  horizontal  and  vertical  indices  respectively. 

We  assign  consecutive  horizontal  indices  to  the  diagonals  in  increasing  order  of 
their  horizontal  weights  with  the  diagonal  having  the  least  horizontal  weight  assigned 
the  horizontal  index  1.  Similarly,  we  assign  consecutive  vertical  indices  to  the  diagonals 
in  increasing  order  of  their  vertical  weights  with  the  diagonals  having  the  least  vertical 
weight  assigned  the  vertical  index  1 . 

Algorithm 

We  now  describe  the  hexagonal  array  mapping  algorithm. 

1.  Perform  a  hexagonal  diagonalization  Dh  **  <D,wH>  of  the  cube  graph. 
For  every  D<pq>€D  assign  a  processor  to  the  pth  row  and  q,k  column  of  a 
mesh. 


-  *•* 
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2.  Choose  njj-w,,  n/2*»w2  and  nj3“w3.  This  fixes  the  neighborhood  con* 
stants  of  the  labels. 

3.  Choose  dfl=*l,  d/2*=l  and  d/3— »1. 

4.  Map  vertices  on  D<pq>  onto  the  processor  in  the  ptk  row  and  qtfc  column, 
that  is,  tyvx  in  D<pq>,  let  PA(vx)™<p,q>. 

i=3 

5.  Let  TA(vx)=£x/jd/j  +  tt  where  TA(v0)=t!. 

i=i 

3.4.  Mapping  on  Tree  Machines 

Unlike  in  the  linear-array  mapping  we  are  required  to  constrain  the  choice  of 
w,  w2  and  w3.  Let  <w,  w2  w3>€{<1,1,1>,  <1,-1, -1>}.  A  linear  diagonalization  is  per¬ 
formed  on  the  cube  graph  before  being  mapped  onto  the  tree  amchine.  The  first  four 
steps  involved  in  mapping  a  cube  graph  on  a  tree  processor  is  the  same  as  the  first  four 
steps  in  mapping  cube  graphs  onto  linear  arrays.  An  additional  step  is  involved  for 
fixing  the  perturbation  delays  as  follows.  Let  p  be  a  processor  index  in  the  tree.  (Recall 
that  indexing  is  done  by  a  depth-first  traversal  of  the  tree.) 
case  1:  If  <w,,w2,w3>—<1,1,1>  then  A(/  l,p)— A(/2,p)*»A(/3,p). 
case  2:  If  <w1,w2,w3>=—  <1,-1, -1>  then  A(f  l,p)— A(( 2,p+l)=*-A(13,p+l). 

The  final  step  involves  fixing  the  times  at  which  the  vertices  are  mapped.  Let  vx6Dp. 
i=3  |hl 

Then  TA(v„)— t,+  £xijd(i+£A(*  i,j)  where  TA(v0)» tj. 
i=i  j-i 

The  constraints  on  the  delay  perturbations  (cases  1  and  2  above)  are  motivated  by 
the  following  discussion.  Let  T  be  an  arbitrary  tree  whose  vertices  are  numbered  by 
some  depth-first  traversal  of  the  tree  as  shown  in  Fig.  3.1.  The  vertex  numbered  i  will  be 
referred  to  as  v,.  Now  replace  each  edge  in  the  tree  by  a  pair  of  edges  between  the  two 


vertices  and  consider  a  closed  path  in  this  graph  from  v,  back  to  itself  that  visits  ail  the 


vertices  in  the  order  vt,  v2,  v,  as  shown  in  Fig.  3.2. 


Figure  3.1 


Such  a  path  is  composed  of  forward  edges  (those  encountered  while  traversing  from  Vj  to 
vj,  i<j  )  and  reverse  edges  (those  used  to  backtrack  over  previously  visited  vertices). 
Each  reverse  edge  is  assumed  to  have  a  constant  delay  d  associated  with  it;  a  forward 
edge  has  a  delay  (d|t,d{2  or  dj3)  which  depends  on  the  label  (/ 1 ,/ 2  or  /3)  of  the  stream 
traversing  the  edge. 

In  case  1,  all  the  three  streams  11,  12  and  /3  traverse  the  closed  path  mentioned 
above.  If  there  are  xp  reverse  edges  in  this  path  between  vp  and  vp+1  (note  xp>0),  then 
the  effective  delay  for  a  stream  labelled  / j  in  traversing  between  vp  and  vp+1  is 
((I  j,p)*=d|j+xpd,  corresponding  to  a  delay  perturbation  xpd.  Note  that  the  perturbation 
delay  between  vp  and  vp+l  for  any  p,  is  the  same  for  all  labels. 

In  case  2,  elements  of  stream  1 1  propagate  from  vt  to  the  leaf  vertices  in  a  series  of 
local  broadcast  steps.  An  element  at  vp  is  broadcast  to  all  vertices  vq,  q>p,  that  are 
adjacent  to  vp  in  the  tree  as  shown  in  Fig.  3.3. 
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Broadcast  path  for  1 


:  Forward  edges  for  1?, 


:  Reverse  edges  for  1_. 
path  followed  by  1,  and  1^: 
(j ih. . . . cba) 


1 

1 


3 

3 


Figure  3.3 

The  elements  encounter  a  delay  d/t  in  moving  from  vp  to  vq.  Owing  to  the  depth-first 
numbering  scheme,  the  difference  between  the  time  at  which  the  values  of  a  data  ele¬ 
ment  reaches  Vp+j  and  the  time  at  which  it  reaches  vp  is  (xp-l)d(1,  where  xp  is  the 
number  of  reverse  edges  between  vp  and  Vp+j.  Note  however,  that  the  element  does  not 
traverse  these  reverse  edges,  but  a  copy  of  its  value  reaches  Vp+,  by  the  direct  broadcast 
path.  Thus  if  xp=0  (i.e.,vp  and  vp+1  are  physically  adjacent  in  the  tree)  then  the  element 
will  reach  vp+1,  d/j  cycles  later  than  it  reaches  vp;  else  it  will  reach  vp+i  at  the  same  or 
earlier  time  than  it  reaches  vp.  The  effective  delay  encountered  between  vp  and  vp+1  is 
j(/l,p)«-(Xp-l)d(j,  corresponding  to  a  perturbation  A(f  l,p)— -Xpdjj. 

Elements  of  streams  of  Z 1  and  /  2  traverse  a  closed  path  around  the  tree  as  before, 
but  in  the  direction  opposite  to  that  in  case  1,  that  is,  in  the  direction  ▼».▼»-!, -.Vi-  The 
effective  delay  for  either  of  these  streams  (say  12)  between  vp+1  and  vp  is  d(2+xpd, 
corresponding  to  a  perturbation  A(f  2,p+l)™xpd™A(J3,p+l).  The  conditions  in  case  2 
can  be  satisfied  by  choosing  d-d^. 

In  the  appendix  we  have  shown  that  the  mapping  algorithms  for  tree  machines, 
linear,  mesh  and  hexagonal  arrays  correctly  map  a  cube  graph. 


I 

Recall  that  the  host  machine  inserts  input  values  and  extracts  the  result  values 
from  the  array.  We  now  describe  the  evaluation  of  the  times  at  which  insertion  and 
extraction  must  be  done.  Also  recall  that  the  source  vertex  represents  an  initial  value 
and  the  sink  vertex  represents  a  final  value.  Without  loss  of  generality,  let  vx  be  the 
computation  vertex  connected  to  a  source  (sink)  vertex  by  an  edge  labelled  1.  The  delays 
in  the  links  having  identical  labels  are  all  the  same.  Hence,  if  the  distance  of  the  proces¬ 
sor  (onto  which  vx  is  mapped)  from  the  external  input  (output)  port  is  k  then  the  input 
(output)  value  represented  by  the  source  (sink)  vertex  must  be  inserted  (extracted)  into 
(from)  the  array  by  the  host  at  time  t-k  nj(t+k  n /). 

We  next  synthesize  three  algorithms  to  illustrate  our  mapping  techniques.  The 
first  involves  synthesis  of  a  novel  linear-array  matrix  multiplication  algorithm  that  we 
first  reported  in  [10].  We  will  then  synthesize  another  matrix  multiplication  algorithm  on 
the  tree  machine.  In  our  final  example  we  will  synthesize  an  algorithm  for  multiplication 
of  band  matrices  on  a  hexagonal  array  that  appeared  in  [5]. 

Example  3.1  Consider  multiplication  of  two  dense  matrices  A  and  B  as  shown  below. 


A  program  for  computing  this  multiplication  is  given  by  the  following  recurrence. 


c(y>W,(k)+aikbkj,  l<i,k<2  and  l<j<3 

4"-> 


The  data-flow  description  of  this  computation  is  shown  in  Fig.  3.4. 


Figure  3.4 


In  Fig.  3.4,  p^  and  q,  denote  computation  vertices.  The  horizontal,  vertical  and  oblique 
incident  edges  of  pjj  are  labelled  1 1,  12  and  13  respectively.  Similarly  the  horizontal, 
vertical  and  oblique  outgoing  edges  of  pjj  are  labelled  1 1,  12  and  / 3  respectively.  If  the 
horizontal,  vertical  and  oblique  incident  edges  of  or  q^  represent  the  values  a,  b  and 
c  respectively  then  the  horizontal,  vertical  and  oblique  outgoing  edges  of  pjj  or  q,, 
represent  the  values  a,  b  and  c+ab  respectively.  In  Fig.  3.4,  the  obliqne  input  edge 
incident  on  p^  represents  the  value  whieh  is  0.  The  oblique  outgoing  edge  from  q,, 
reresents  the  final  (output)  value  c,j3)  of  Cjj,  i.e.,  ailblj+a,2b2r 

The  graph  in  Fig.  3.4  is  a  cube  graph  as  illustrated  in  Fig.  3.5.  The  cube  graph  is 
shown  without  the  souree  and  sink  vertices  for  purposes  of  clarity.  The  maximum 


dimensions  of  /ltk,/2*d  and  / 3rd  axes  is  2,  1  and  1  respectively,  i.e.,  ht—2,  h2— 1  and 

h,-l. 


<  J.0,0> 


Figure  3.5 

We  next  map  this  graph  onto  a  linear  array  using  the  linear-array  mapping  algo¬ 
rithm. 

Let  w2,  fj>»<l,l,*l>.  For  this  choice  of  w^,  the  set  D  of  diagonals 

is  comprised  of  {  q,j  },  D2**{  pn,  q12>  q^  },  Ds=»{  pl2,  p21,  qlSl  q^  }, 

D<"{PiJ>  P«,  q23  }*  { P2J  }• 

We  use  |D|—5  processors  indexed  from  l  to  5.  The  neighborhood  constants  for 
labels  1 1,  12  and  13  are  n^^l,  nf2*»l  and  n/s«-l.  The  vertices  in  Dj  are  mapped  onto 
processor  indexed  i.  The  delays  for  the  labels  / 1,  /2  and  /3  are  d(1—l,  d(2— 2  and 
<l|$“l.  The  resulting  mapping  of  the  entire  cube  graph  is  shown  in  Fig.  3.6.  The  times 
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4 


u  4 


at  which  a  computation  vertex  is  mapped  is  indicated  by  the  side  of  the  computaion  ver¬ 
tex,  for  instance,  p^  is  mapped  onto  processor  3  at  time  tj+2.  If  A  and  B  were  nXn 
matrices  then  the  synthesized  algorithm  above  would  require  0(n)  processors  and  will 
take  0(n2)  time  steps  to  compute  the  result  matrix. 


Figur*  3.6 

Example  3.2:  Consider  again  multiplication  of  the  two  matrices  in  the  previous  exam¬ 
ple.  We  will  synthesize  a  tree  algorithm  for  multiplying  the  two  matrices. 

Let  wli«»<w1  w2Wj>«»<1,-1,-1>.  For  this  choice  of  wL,  the  set  D  of  diagonals  is 
comprised  of  Dt*{  qjj  },  D2*{  q^,  qjj,  p2j  },  D3— (q2S,  qj2,  p^,  Pn},D4“{qu,  p2S,  p32 
}.  Ds— {pls  }• 

We  use  |D|“5  processors  indexed  from  1  to  5.  The  neighborhood  constants  for 
labels  1 1,  12  and  / 3  are  n(1»»l,  n/2*»n|S«—-l.  Vertices  in  Dj  are  ail  mapped  onto 
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processor  indexed  i.  The  delays  for  labels  1 1,  12  and  /3  are  d|,=l,d(;=l  and  d|3=Q. 
Let  the  five  vertex  tree  representing  the  tree  array  be  as  shown  in  Fig.  3.7  below. 


Figure  3.7 

Since  the  choice  of  n/lt  n/2,  and  n/3  satisfies  case  2,  we  choose  the  delay  d  along  reverse 
edges  to  be  equal  djj.  The  perturbations  in  the  delay  for  /I  satisfy  A(/1.1)=0, 
A(I  l,2)=0,  A(/l,3)=— 1  (there  is  one  reverse  edge  between  v3  and  vj  and  A(/l,4)=-2. 
The  perturbations  for  12  and  (3  satisfy  A(/2,j)=A(/3,j)=-A(/l,j-l),  j=2,..,5.  The 
effective  delay  between  logically  adjacent  processors  (6  '  s)  is  shown  in  Fig.  3.8  for  each 
stream.  The  resulting  mapping  of  the  cube  graph  is  also  shown  in  the  Fig.  3.8.  The 
times  at  which  a  computation  vertex  is  mapped  is  calculated  from  the  final  step  of  the 
mapping  algorithm  for  tree  machines  and  is  indicated  by  the  side  of  the  computation 
vertex.  If  A  and  B  were  nXn  matrices  then  the  tree  algorithm  will  require  0( n)  proces¬ 
sors  and  interestingly,  0(n2)  time  steps  to  compute  the  result  matrix  !! 


(Numbers  on  edges  indicate  the  effective  delay  between  logically 
adjacent  processors  for  the  tree  of  Figure  3.7) 

Figure  3.8 

Example  3.3  Consider  multiplication  of  tiro  band  matrices  A  and  B  as  shown  below 
wherein  and  b,;  denote  the  |ij]tk  entries  in  A  and  B  respectively. 
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Let  C=AXB  be  the  result  matrix.  The  data-flow  description  in  Fig.  3.9  represents  mul¬ 
tiplication  of  AXB.  The  horizontal,  lateral  and  vertical  edges  are  labelled  1 1,  12  and  / 3 
respectively.  In  Fig.  3.9,  vijt+1  is  the  computation  vertex  at  a  vertical  distance  k  from 
v,j.  Thus,  Vm  is  the  computation  vertex  at  a  vertical  distance  2  from  v2i.  The  program 
graph  in  Fig.  3.9  is  a  cube  graph  as  illustrated  in  Fig.  3.10.  We  next  map  this  graph  on  a 
hexagonal  array  using  the  hexagonal  array  mapping  algorithm. 

Let  wH=s<w1,w2,w3>=s<l,l,-l  >  and  c=l.  It  can  be  verified  that  for  this  choice 
of  wH  the  set  of  diagonals  D  is  comprised  of  {  Djj  J  1  < i,j < 4} . 

The  hexagonal  array  is  comprised  of  4  rows  and  columns  of  processors  which  are 


4 


identical  to  the  procesors  used  in  example  3.1.  Lh— {H},  LV={I2}  and  LT— {/3}.  The 
neighborhood  constants  for  the  labels  are  nn=n(;=l  and  n,3=-l.  The  delays  are 
dii!“*d/2a“dj3=»l.  The  constant  c  for  the  array  is  1.  Fig.  3.11  iluustrates  the  mapping. 


ss 


Figur*  3. 9 


Figure  3.10 
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Conclusion 

In  this  paper  we  formalized  linear,  mesh  and  hexagonal  array  processors  suitable  for 
VLSI  implementation.  We  also  presented  a  model  of  a  tree  machine.  We  then  presented 
novel  algorithms  for  dense  matrix  multiplication  on  a  linear  array  and  the  tree  machine. 
We  also  derived  a  hexagonal  array  algorithm  for  multiplying  band  matrices.  Our  linear- 
array  algorithm  for  multiplying  dense  matrices  is  particularly  useful  in  situations  where 
the  I/O  bandwidth  is  limited  as  the  algorithm  requires  only  a  constant  (three)  number  of 
I/O  ports  for  inserting  the  elements  of  A  and  B  matrices  and  retrieving  the  result  values. 
The  tree  algorithm  has  the  same  features  as  the  linear-array  algorithm.  More  impor¬ 
tantly,  the  tree  algorithm  is  robust  to  harware  faults  in  the  underlying  host. 
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Appendix 

We  first  prove  that  the  mapping  algorithm  for  the  tree  machine  correctly  maps  the 
cube  graph.  We  begin  by  first  showing  that  the  mapping  preserves  the  neighborhood 
constant  of  the  labels. 

Theorem  A.1:  Let  /€Lq  and  let  n(  and  d(  be  its  neighborhood  and  delay  constants 
respectively.  If  vx  and  vy  are  a  pair  of  computation  vertices  with  an  edge  labelled  / 
directed  from  vx  to  vy  then  PA(vy)*=PA(vx)+n/. 

Proof:  Let  vx  and  vy  be  the  vertices  in  diagonals  Dp  and  Dq  respectively  and  wp  and  wq 
be  the  weights  of  Dp  and  Dq  respectively.  So, 

w1x/,+w2x/2+^3xM“,,rp»  and 

*iJii+w2y/2+W3y,3— wq 

We  will  show  that  the  theorem  holds  for  /*/ 1  as  the  proofs  for  /=/  2  and  /  3  are 

similar. 

Let  e  be  the  edge  labelled  /  directed  from  vx  to  vy.  From  the  definition  of  a  cube 
graph  we  obtain  yn’“x/i+l»  7 12’mxn  and  y/s“x/j-  Consequently,  wq-wp»«w,*l.  Since 
the  diagonals  are  indexed  in  order  of  their  weights,  it  follows  that  index  of  Dq  must  be 
one  more  than  the  index  of  Dp,  that  is,  q=*p+l. 

The  mapping  algorithm  maps  vertices  in  Dp  onto  processor  p  and  those  of  Dq  onto 
processor  p+Wj  and  hence  PA(vy)»PA(vx)+w,.  Also  from  the  mapping  algorithm 
n/i«*w,.  So  the  theorem  holds  for  1.  Q 


We  next  show  that  the  mapping  preserves  the  delay  constant  of  every  label  /. 


Theorem  A.2:  Let  /€ LG  and  let  n /  and  d(  be  its  neighborhood  and  delay  constants 


respectively.  Let  vx  and  vy  be  a  pair  of  vertices  with  an  edge  labelled  l  directed  from  vx 
to  vy.  If  vx  is  in  diagonal  Dp  then  TA(vy)»TA(vx)+i(/,p). 

Proof:  We  have  to  consider  the  two  cases  when  BuBn^^nuul  and 
n/2»*n/3*a-l. 

ate  Is  n,,  =■  n/2  “  n,3  **  1. 

Let  vy6Dq  and  f=*f 1  with  no  toss  of  generality.  From  the  final  step  in  the  mapping  algo* 
rithm  for  the  tree  machine  we  obtain: 

TA(vx)*tj  +  £xndtl  +  £A(/l,j) 
i=i  i~l 

TA(vy)*tj  +  Sy,,d,i  +  EA(/l,j) 

i=i  j=i 

By  definition  of  a  cube  graph  we  have,  xt2  **  yf2  ,  xti  =  y/3  and  y(1  »  xM  +  1.  From 
theorem  A.l  ive  obtain  PA(vy)*PA(vx)+l,  i.e.,  q=p+l.  Therefore, 

TA(vy)-TA(vxHdn  +  £A(/lj)-  £A(ll,j) 

*d«i  +  S  A(*  l,p)=-i(Z  l,p) 

J=P 

ease  2:  nfl— 1,  n(2«*n,s«— 1. 

If  f—fl  then  the  proof  is  the  same  as  that  used  in  case  1.  Else  let  f=-»f  2  with  no  loss  of 
generality.  Again  by  definition  of  a  cube  graph  we  have, 
x< ,  «■»  jr/j  ,  x,s  <—  y(s  and  y(2  “  x(2  +  1.  From  theorem  A.l  we  obtain  PA(vy)~PA(vx)- 
1,  i.e.,  q«-p-l.  So, 

TA(vy)-TA(vx)-d(2+  SA(Ilj)-  SA(Hj) 

£il 

— d,2  -  A(/l,q)— d,2  +  A(/2,q+l)— df2  +  A(/2,p) 
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-«/2,p) 

□ 

We  have  to  next  establish  that  no  two  values  appear  simultaneously  at  the  input  port  of 
any  processor  and  the  following  definition  and  lemma  comes  in  handy  for  proving  it. 

Definition  A.1  For  any  label  l  6LG,  a  major  path  labelled  /  in  G  is  a  directed  path  from 
a  source  vertex  to  a  sink  vertex  such  that  all  the  edges  in  the  path  are  labelled  L 

Lemma  A.1:  Let  /g Lq  and  n/g{l,-l}.  Let  Pt  and  P2  be  two  distinct  major  paths 
labelled  /  in  G  and  let  vx  and  vy  be  the  computation  vertices  adjacent  to  the  source  ver¬ 
tices  in  Pi  and  P2  respectively.  Let  PA(vx)  s1(  PA(vy)  —  where  s!<s2.  Let 
TA(v„)  —  t,  and  TA(vy)  —  t2.  If  the  input/output  values  represented  by  the  source  and 

sink  vertices  of  P,  and  P2  appear  simultaneously  at  the  input  port  of  a  processor  then 

1 

(to  -  ti)n/>“(s2  -  st)d,  +  n,(  £  A(l  1  j)). 


Proof:  Again  we  need  to  consider  the  two  cases  when  n(t»n(2»n(3  and 

n/i=l,n/2=n/3“— 1. 

case  1:  n(1  —  n/2  —  nM  —  1. 

Since  PA(vx)  =■  s,  and  PA(vy)  —  s2,  we  have  vxgDtj  and  vygDfj.  Assume  without  loss  of 
generality  that  the  input  values  represented  by  the  source  vertices  of  P!  and  P2  appear 
simultaneously  at  the  input  port  of  processor  s.  Let  s<st<s2  and  the  proof  will  be  simi¬ 
lar  for  other  values  of  s.  Let  t  be  the  time  at  which  both  the  values  appear  at  the  input 

port  labelled  l  of  s.  The  time  taken  by  the  input  value  represented  by  the  source  vertex 

1 

of  P|  to  reach  the  input  port  labelled  l  of  Sj  is  t+  £  t(lj)  which  is  TA(vx).  Similarly, 

i— » 


the  time  taken  by  the  input  value  represented  by  the  source  vertex  of  P2  to  reach  the 


•a  ■ 1 

input  port  labelled  /  of  s-j  is  t+  2  which  is  TA(v;)  and  hence, 

j=» 

1 

t,  —  TA(v„)«* t+(s,  -  s)d,+  S  Ml  j)/  “d 

•*-"t 

tj  —  TA(v,)— «t+(s2  -  s)d/+  £  A(l  1,  j),  and  hence, 

»a  -  »  J 

tj  -  tj«(s2  -  *i)d(+  ^  A(l  1  j) 

•a-  1 

Since  n(  «■  1  by  hypothesis,  we  obtain  (t2  -  t1)n,«=(s2  -  s1)d/-t-n/(  £  A(/l  j)). 
case  2:  n/,  —  1,  n/2  —  n/3  —  -1. 

If  <■■/ 1,  same  proof  as  case  I  holds  else  assume  b*l2  with  no  loss  of  generality. 
n/2*»-l,  and  s2>s1>s.  As  illustrated  in  the  figure  below,  if  the  two  values  have  to  meet 
at  s  at  time  t  then  t2>t1>t. 


•i  »» 

Now  1*1,+  £  t(l 2,j)»tl+(sj  -  s)d/2+  £  A(/2j)  is  the  time  taken  by  the  input 
j=«+i  j-#-n 

value  represented  by  the  source  vertex  of  Pt  to  reach  s, 


h  'a 

and  t*t2+  J]  f(l 2j)=t2+(s2  -  s)d;2+  £  Ml  2j)  is  the  time  taken  by  the  input  value 
j=»+i  j«»-H 

represented  by  the  source  vertex  of  P2  to  reach  s. 


Since  the  values  meet  at  s,  the  time  t  is  the  same  in  both  the  equations  and  hence, 


(t2  -  tO— (*j  -  »2)dj2+  £  A(/2j)-  £  Ml2j) 

— (sj  -  s2)d(2-(  £  A(l2j)~  £  A(l2j)) 
i— »+i  j—»+i 

— (*i-s2)d/r  S  Ml  2j) 


» 


1,-1 

Since  A(J2,j)— -A(i  lj-1)  we  have,  (t2  -  t,)— (*i  -  »2)d/2+  E  A(/l,k) 

k=i, 

*4-  1 

Also  as  n/2  —  -l  ,  so  (^  -  t,)n<2— (s,  -  s^/j+nj^  £  A(/  l.k)).  □ 

k— , 

We  next  show  that  the  mapping  ensures  that  no  two  input/output  values  appear 
simultaneously  at  the  input  port  of  any  processor. 

Theorem  A.3  Let  £La-  Let  Pt  and  P2  be  two  distinct  major  paths  in  G  labelled  /. 
The  mapping  ensures  that  the  input/output  value  represented  by  the  source/sink  ver¬ 
tices  of  P,  and  P2  never  appear  simultaneously  at  the  input  port  labelled  /  of  any  proces¬ 
sor. 


Proof:  Let  vx  and  vy  be  the  vertices  adjacent  to  the  source  vertices  in  Pj  and  P2  respec¬ 
tively.  From  the  mapping  algorithm  we  obtain, 

PA(vy)-PA(vJ=A(P)=»X;kin,j  where  kj«y/rx,j  and  -hj<ki<V 

i=l 

Let  vx€Dp,  vy6Dq  and  p<q  with  no  loss  of  generality.  From  the  mapping  algorithm  we 
also  obtain, 

TA^TA^-AT-S^  -  A(llJ)-£  A(llj) 

s  i=l  j=1  J=1 

1=1  i=p 

Now  assume  that  the  input/output  value  represented  by  the  source/sink  vertices  of  P, 
and  P2  appear  simultaneously  at  the  input  port  labelled  fl  of  a  processor.  By  lemma 
A.l  we  have, 

q-i 

(AT)n/j—(AP)d/,+nfl(  £  A (/  i  j))  which  is  the  same  as 
i-p 
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n„(  SMi.)+»li(S  A(/l,j)MAP)dM+n(1(£  A (/ 1  j))  and  hence, 

1=1  =jJ“p  J”p 

(AP)d<1-na(,Vk1d(i) . (*).. 

i=l 

We  next  show  that  («)  cannot  be  satisfied. 

1.  Let  n/2**l  and  so  by  the  mapping  algorithm,  dM=l  and  d/2=2.  P,  and  P2 
are  distinct  major  paths  labelled  / 1  and  so  k2=k37^0. 

a.  Let  h^+njj^O.  So  d/J=h1+l+2n/s  and  (*)  reduces  to 

k3(h,+l+n/3)+k2=0.  Now  h,+l+n/3>l  and  so  k2y^0  and  k3^0. 
Besides  h2<h1+n/3  and  -h2<k2<h2  and  so  (*)  cannot  be  satisfied. 

b.  Let  hrh2+n/3<0  and  so  df3=*ht+nj3  and  (*)  reduces  to  k3h2+k2=0. 
Now  h2>l  and  so  k2^0  and  k3y^0.  Besides  -h2<k2<h2  and  so  (*) 
cannot  be  satisfied. 

2.  Let  n/2=-l.  So  da=»l  and  d<2=l. 

a.  Let  ho-hj+nf^O  and  so  d/3=»2h2+l-l-n/3.  So  (*)  reduces  to 

2k2+k3(2h2+l)—0.  As  h2>l,  so  211,+! >3  and  so  k2y^0  and  k3y^0. 
Besides  -b2<k2<b2  and  so  -(2h2+l)<2k2<2h2+l  and  so  (♦)  cannot  be 
satisfied. 

b.  Let  hrh,+n/3<0  and  so  d/3=2h,+l-n,3.  So  (*)  reduces  to 

2k2+k3(2h,+l-2n,3)— 0.  Now  l<h2<h1-n(3.  So  2h,+l-2n/3>l  and 
hence  k2^0  and  k3^0.  Besides  -h2<k2<h2  and  so 
-(2hj+l-2n/3)<2k2<2h,+I-2n/3  and  hence  (*)  cannot  be  satisfied. 

Using  the  inequality  relationships  between  kj,  k2,  k3  and  ht,  h2,  h3  we  can  similarly 

i=3  i=S 

establish  that  the  two  equations  AP  d(2«-(  SMj;)  n,2  and  AP  d|j"(EMn)  nn 

i=l  i=I 


cannot  be  satisfied  and  hence  no  two  input/output  values  will  appear  simultane¬ 
ously  at  the  input  port  of  any  processor  labelled  /2  or  /3.  I  [ 


Proof  that  the  linear- array  mapping  algorithm  correctly  maps  a  cube  graph  on  a 
linear  array  follows  immediately  from  the  proof  of  correctness  of  mapping  cube  graphs 
onto  tree  machines  by  letting  the  perturbation  delay  S’s  be  zero  in  the  above  proofs. 

It  can  be  easily  established  that  if  vx  and  vy  are  two  computation  vertices  con¬ 
nected  by  an  edge  labelled  l  then  the  mesh-array  mapping  algorithm  maps  the  vertices 
on  processors  which  are  on  the  same  horizontal  row  if  /£LH  (like  processors  11,  12  and 
13  in  Fig.  2.2)  or  on  the  same  vertical  column  if  f£  Ly  (like  processors  11,  21  and  31  in 
Fig.  2.2). 

It  can  be  similarly  established  that  the  hexagonal-array  mapping  algorithm  maps 
the  two  vertices  on  the  same  row  of  processors  aligned  in  a  north-easterly  direction  (like 
processors  11,  12  and  13  in  Fig.  2.3)  if  f€LH.  If  /  6Ly  they  are  mapped  on  a  row  of  pro¬ 
cessors  aligned  in  a  north-westerly  direction  (like  processors  11,  21  and  31  in  fig  3.3)  and 
if  /€ Lr  the  vertices  are  mapped  on  the  same  column  of  processors  (like  processors  21 
and  12  in  Fig.  3.3).  All  these  rows  and  columns  constitute  a  linear  array  and  hence  the 
correctness  proof  used  above  can  be  used  to  establish  that  the  mesh  and  hexagonal- 
array  mapping  algorithms  also  map  cube  graphs  correctly. 
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